17 research outputs found
RFaaS: RDMA-Enabled FaaS Platform for Serverless High-Performance Computing
The rigid MPI programming model and batch scheduling dominate
high-performance computing. While clouds brought new levels of elasticity into
the world of computing, supercomputers still suffer from low resource
utilization rates. To enhance supercomputing clusters with the benefits of
serverless computing, a modern cloud programming paradigm for pay-as-you-go
execution of stateless functions, we present rFaaS, the first RDMA-aware
Function-as-a-Service (FaaS) platform. With hot invocations and decentralized
function placement, we overcome the major performance limitations of FaaS
systems and provide low-latency remote invocations in multi-tenant
environments. We evaluate the new serverless system through a series of
microbenchmarks and show that remote functions execute with negligible
performance overheads. We demonstrate how serverless computing can bring
elastic resource management into MPI-based high-performance applications.
Overall, our results show that MPI applications can benefit from modern cloud
programming paradigms to guarantee high performance at lower resource costs
Work-stealing prefix scan: Addressing load imbalance in large-scale image registration
Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.ISSN:1045-9219ISSN:1558-2183ISSN:2161-988
Performance-Detective: Automatic Deduction of Cheap and Accurate Performance Models
The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra
We propose GraphMineSuite (GMS): the first benchmarking suite for graph
mining that facilitates evaluating and constructing high-performance graph
mining algorithms. First, GMS comes with a benchmark specification based on
extensive literature review, prescribing representative problems, algorithms,
and datasets. Second, GMS offers a carefully designed software platform for
seamless testing of different fine-grained elements of graph mining algorithms,
such as graph representations or algorithm subroutines. The platform includes
parallel implementations of more than 40 considered baselines, and it
facilitates developing complex and fast mining algorithms. High modularity is
possible by harnessing set algebra operations such as set intersection and
difference, which enables breaking complex graph mining algorithms into simple
building blocks that can be separately experimented with. GMS is supported with
a broad concurrency analysis for portability in performance insights, and a
novel performance metric to assess the throughput of graph mining algorithms,
enabling more insightful evaluation. As use cases, we harness GMS to rapidly
redesign and accelerate state-of-the-art baselines of core graph mining
problems: degeneracy reordering (by up to >2x), maximal clique listing (by up
to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x),
also obtaining better theoretical performance bounds
FaasKeeper: a Blueprint for Serverless Services
FaaS (Function-as-a-Service) brought a fundamental shift into cloud
computing: (persistent) virtual machines have been replaced with dynamically
allocated resources, trading locality and statefulness for a pay-as-you-go
model more suitable for varying and infrequent workloads. However, adapting
services to function within the serverless paradigm while still fulfilling
requirements is challenging. In this work, we introduce a design blueprint for
creating complex serverless services and contribute a set of requirements for
efficient and scalable FaaS computing. To showcase our approach, we focus on
ZooKeeper, a centralized coordination service that offers a safe and wait-free
consensus mechanism but requires a persistent allocation of computing resources
that does not offer the flexibility needed to handle variable workloads. We
design FaaSKeeper, the first coordination service built on serverless functions
and cloud-native services. FaaSKeeper provides the same consistency guarantees
and interface as ZooKeeper with a price model proportional to the activity in
the system. In addition, we define synchronization primitives to extend the
capabilities of scalable cloud storage ser- vices with consensus semantics
needed for strong data consistency
FMI: Fast and Cheap Message Passing for Serverless Functions
Serverless functions provide elastic scaling and a fine-grained billing model, making Function-as-a-Service (FaaS) an attractive programming model. However, for distributed jobs that benefit from large-scale and dynamic parallelism, the lack of fast and cheap communication is a major limitation. Individual functions cannot communicate directly, group operations do not exist, and users resort to manual implementations of storage-based communication. This results in communication times multiple orders of magnitude slower than those found in HPC systems. We overcome this limitation and present the FaaS Message Interface (FMI). FMI is an easy-to-use, high-performance framework for general-purpose point-to-point and collective communication in FaaS applications. We support different communication channels and offer a model-driven channel selection according to performance and cost expectations. We model the interface after MPI and show that message passing can be integrated into serverless applications with minor changes, providing portable communication closer to that offered by high-performance systems. In our experiments, FMI can speed up communication for a distributed machine learning FaaS application by up to 162x, while simultaneously reducing cost by up to 397 times